38  Path Analysis and SEM - Practical

38.1 Introduction

Path analysis and Structural Equation Modeling (SEM) allow us to test hypotheses about the direct and indirect effects among observed and latent variables, but they do so in slightly different ways.

Path analysis is a precursor to SEM and is used to assess the direct and indirect relationships between measured variables. It’s really a specialised case of multiple regression that allows for the exploration of models with multiple dependent variables.

Path analysis is represented through diagrams that illustrate the causal relationships between variables, with arrows indicating the direction of the relationship.

So, path analysis:

  • Focuses on observed variables.

  • Deals with direct and indirect effects but does not include latent variables.

  • Uses regression coefficients to estimate the strength of the relationships.

SEM extends path analysis by including latent variables (unobserved variables that are inferred from observed variables) in addition to the observed variables. This makes SEM a more powerful and flexible technique than path analysis. SEM encompasses a broader array of models, including confirmatory factor analysis, path analysis, and others, allowing for the analysis of complex relationships between observed and latent variables.

So, SEM:

  • Incorporates both observed and latent variables.

  • Can model multiple levels of relationships simultaneously, including mediation and moderation effects.

  • Uses covariance among variables to estimate model parameters, creating a more complex view of variable interrelations.

38.2 Path analysis: demonstration

38.2.1 Step One: Create a dataset

Show code for dataset creation
rm(list=ls())
# Load necessary library
library(tidyverse)

# Set seed for reproducibility
set.seed(123)

# Generate synthetic dataset
n <- 200 # Number of observations

# Independent Variables
PossessionTime <- round(rnorm(n, mean=30, sd=5),2)
Tackles <- round(rpois(n, lambda=20),0)
SuccessfulPasses <- round(rpois(n, lambda=150),0)
TerritorialGain <- round(rnorm(n, mean=1000, sd=200),2)
PenaltiesConceded <- round(rpois(n, lambda=10),0)

# Generating PerformanceScore with associations

PerformanceScore <- round(50 + (PossessionTime * 0.5) + (SuccessfulPasses * 0.05) - (PenaltiesConceded * 1) + rnorm(n, mean=0, sd=5),2)

df <- tibble(PossessionTime, Tackles, SuccessfulPasses, TerritorialGain, PenaltiesConceded, PerformanceScore)

# Print out the first few rows to check
head(df)
# A tibble: 6 × 6
  PossessionTime Tackles SuccessfulPasses TerritorialGain PenaltiesConceded
           <dbl>   <dbl>            <dbl>           <dbl>             <dbl>
1           27.2      29              167            842.                15
2           28.8      25              137            690.                11
3           37.8      18              147           1492.                 7
4           30.4      22              140            968.                13
5           30.6      18              155            981.                 8
6           38.6      17              160           1084.                15
# ℹ 1 more variable: PerformanceScore <dbl>
Show code for dataset creation
rm(n, PenaltiesConceded, PerformanceScore, PossessionTime, SuccessfulPasses, Tackles, TerritorialGain)

The dataset contains data for 200 rugby games, and has been reduced to the following variables:

PossessionTime - Time in possession of the ball (in minutes).

Tackles - Number of tackles made.

SuccessfulPasses - Number of successful passes.

TerritorialGain - Total territorial gain (in meters).

PenaltiesConceded - Number of penalties conceded.

PerformanceScore - Game evaluation by three independent coaches.

38.2.2 Step Two: Exploratory Data Analysis (EDA)

The first step in Path Analysis is to define our model.

In this instance, our analysis is based on exploring the interactions between our independent variables (IVs) and the dependent variable (DV), which we have determined will be PerformanceScore.

We’ll start, as always, by conducting some EDA on the data. After some descriptive statistics, we’ll concentrate on exploring the relationships between the different variables, as this will be the focus of our subsequent Path Analysis.

# Load libraries
library(dplyr)
library(ggplot2)
library(GGally)

# Basic summary stats
summary_stats <- df %>% summary()
print(summary_stats)
 PossessionTime     Tackles   SuccessfulPasses TerritorialGain 
 Min.   :18.45   Min.   :10   Min.   :112.0    Min.   : 498.4  
 1st Qu.:26.87   1st Qu.:17   1st Qu.:142.0    1st Qu.: 879.2  
 Median :29.70   Median :20   Median :150.5    Median :1028.6  
 Mean   :29.96   Mean   :20   Mean   :150.0    Mean   :1016.5  
 3rd Qu.:32.84   3rd Qu.:23   3rd Qu.:159.0    3rd Qu.:1157.9  
 Max.   :46.21   Max.   :35   Max.   :187.0    Max.   :1538.3  
 PenaltiesConceded PerformanceScore
 Min.   : 3.00     Min.   :48.66   
 1st Qu.: 8.00     1st Qu.:58.83   
 Median :10.00     Median :62.44   
 Mean   : 9.93     Mean   :62.48   
 3rd Qu.:12.00     3rd Qu.:65.69   
 Max.   :20.00     Max.   :76.46   

I’m going to use the a correlation matrix to explore the relationships within the data:

library(corrplot)
corrplot 0.92 loaded
cor_matrix <- cor(df) # create the correlation matrix
cov_matrix <- cov(df) # create the covariance matrix

# Visualise  correlation matrix
corrplot(cor_matrix, method = "number")

I’ll create a new variable PerformanceCategory based on how the game was scored. This allocates ‘high’ or ‘low’, depending on the PerformanceScore variable.

# Assuming df is your dataframe and PerformanceScore is the continuous variable

# Calculate the median of PerformanceScore for the threshold
median_score <- median(df$PerformanceScore)

# Create a new variable based on PerformanceScore being high or low
df$PerformanceCategory <- ifelse(df$PerformanceScore > median_score, 'high', 'low')

# Check the first few rows to verify the new variable
head(df)
# A tibble: 6 × 7
  PossessionTime Tackles SuccessfulPasses TerritorialGain PenaltiesConceded
           <dbl>   <dbl>            <dbl>           <dbl>             <dbl>
1           27.2      29              167            842.                15
2           28.8      25              137            690.                11
3           37.8      18              147           1492.                 7
4           30.4      22              140            968.                13
5           30.6      18              155            981.                 8
6           38.6      17              160           1084.                15
# ℹ 2 more variables: PerformanceScore <dbl>, PerformanceCategory <chr>

As we now have our DV in a binary variable (high or low), we can utilise boxplots and density plots to examine the apparent impact of each IV on the outcome.

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Creating a list to store plots
plots_list <- list()

# Iterate over each independent variable to create plots
for (var in names(df)[-6]) { # Excluding the PerformanceCategory variable
  # Boxplot
  boxplot <- ggplot(df, aes_string(x="factor(PerformanceCategory)", y=var, fill="factor(PerformanceCategory)")) +
             geom_boxplot() +
             labs(title=paste("Boxplot of", var, "by PerformanceCategory"),
                  x="Performance Category",
                  y=var,
                  fill="Performance Category") +
             theme_minimal()
  
  # Density plot
  density_plot <- ggplot(df, aes_string(x=var, fill="factor(PerformanceCategory)")) +
                  geom_density(alpha=0.7) +
                  labs(title=paste("Density Plot of", var, "by PerformanceCategory"),
                       x=var,
                       y="Density",
                       fill="Performance Category") +
                  theme_minimal()

  # Store plots in the list
  plots_list[[paste("Boxplot", var)]] <- boxplot
  plots_list[[paste("Density", var)]] <- density_plot
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
# Print the plots
for (plot_name in names(plots_list)) {
  print(plots_list[[plot_name]])
}

rm(boxplot, density_plot, plots_list)

Notice that there are some interesting associations here.

38.2.3 Step Three: Conducting the Path Analysis

We can now move on to the Path Analysis. We’re going to use the package lavaan, which is an R package designed for Path Analysis and SEM.

In the code, I’m going to specify a simple model in which all the IVs directly predict the DV. There are no mediating variables hypothesised in the model.

library(lavaan)

# Define our path analysis model
model <- '
  PerformanceScore ~ PossessionTime + Tackles + SuccessfulPasses + TerritorialGain + PenaltiesConceded
'
# Fit model
fit <- sem(model, data=df, estimator="MLR") # Using Maximum Likelihood estimation for robust estimation

# Summarise results
summary(fit, standardized = TRUE, fit.measures = TRUE)
lavaan 0.6.17 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         6

  Number of observations                           200

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                                74.414      77.298
  Degrees of freedom                                 5           5
  P-value                                        0.000       0.000
  Scaling correction factor                                  0.963

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)               -588.936    -588.936
  Loglikelihood unrestricted model (H1)       -588.936    -588.936
                                                                  
  Akaike (AIC)                                1189.872    1189.872
  Bayesian (BIC)                              1209.662    1209.662
  Sample-size adjusted Bayesian (SABIC)       1190.653    1190.653

Root Mean Square Error of Approximation:

  RMSEA                                          0.000          NA
  90 Percent confidence interval - lower         0.000          NA
  90 Percent confidence interval - upper         0.000          NA
  P-value H_0: RMSEA <= 0.050                       NA          NA
  P-value H_0: RMSEA >= 0.080                       NA          NA
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Regressions:
                     Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
  PerformanceScore ~                                                        
    PossessionTime       0.310    0.070    4.452    0.000     0.310    0.263
    Tackles             -0.114    0.070   -1.635    0.102    -0.114   -0.094
    SuccessfulPsss       0.063    0.024    2.622    0.009     0.063    0.136
    TerritorialGan       0.000    0.001    0.260    0.795     0.000    0.015
    PenaltiesCncdd      -0.791    0.112   -7.047    0.000    -0.791   -0.458

Variances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .PerformanceScr    21.147    2.054   10.297    0.000    21.147    0.689

38.2.4 Step Four: Interpreting the Model

There is a lot of information here! I’ll focus on the key aspects.

At the bottom, you can see a table of path coefficients (‘Regressions’):

  • Coefficients (‘Estimate’): These are the estimated values that show the strength and direction of the relationships between variables. A positive coefficient indicates a positive relationship, while a negative coefficient indicates a negative relationship.
  • Significance (‘P(>|z|)’): Look at the p-values associated with each path coefficient. A p-value less than a conventional threshold (usually 0.05) indicates that the path coefficient is statistically significant. For our current analysis, the key thing to note is the significance of ALL the IVs on the match outcome.

  • Standard Errors (‘Std.Err’): These provide an estimate of the variability in the path coefficients and are useful for understanding the precision of the estimates.

Above that table, you will see various statistics that report model fit. These let us evaluate how well the model fits the data.

  • Comparative Fit Index (CFI): Values above 0.90 or 0.95 suggest a good fit.

  • Tucker-Lewis Index (TLI): Like the CFI, values above 0.90 or 0.95 are typically considered indicative of a good fit.

  • Root Mean Square Error of Approximation (RMSEA): Values less than 0.05 indicate a close fit, and values up to 0.08 represent a reasonable error of approximation.

  • Standardized Root Mean Square Residual (SRMR): Values less than 0.08 are generally considered good.

38.2.4.1 Goodness of Fit Indices (AIC and BIC)

The output also provides two fit indices, which help in assessing the model fit from different perspectives. They are not particularly useful by themselves, but they help you to compare different models.

They lie outside the scope of this practical, but a general rule of thumb is that ‘lower is better’. In other words, a lower AIC for Model 2 compared with Model 1 suggests that Model 2 is ‘better’ at explaining the data, compared with Model 1.

38.2.5 Step Five: Visualising the Model

We can produce a visual representation of our model using the semPlot library. I’ve

# Load necessary libraries

library(semPlot)

# Visualize the path diagram
semPaths(fit, whatLabels="est", edge.label.cex = 0.75, layout="tree")

In summary:

  • We have created a path model where PerformanceScore is regressed on all other variables (PossessionTime, Tackles, SuccessfulPasses, TerritorialGain, PenaltiesConceded).

  • The sem() function from the lavaan package is used to fit the model.

  • The `semPaths() function from the semPlot package creates a visual representation of the fitted model.

The semPlot library offers a variety of options to customize the appearance of the path diagram. For example, `whatLabels can be set to “est” to display path coefficients, “std” for standardized coefficients, or “both” for both. The layout and rotation parameters control the arrangement and orientation of the path diagram.

38.3 Path Analysis: Practice

Replicate all of the steps above on the dataset available here:

rm(list=ls())
swimming_data <- read.csv('https://www.dropbox.com/scl/fi/kmzsi84wltvr9hycwnr3z/31_02.csv?rlkey=7bml7ukawcygwotuy6aemy4fr&dl=1')

swimming_data$X <- NULL

head(swimming_data) # display the first six rows
  age training_hours experience_years diet_quality stamina average_speed
1  20             12                6            2     5.8          0.04
2  20              7                9            5     3.3          0.99
3  27             14                8            9    10.3          1.62
4  19              6               10           10    11.6          1.66
5  23             14                4            6    12.1          0.73
6  28             16                5           10    12.4          2.26

There are two options:

  • The more straightforward is to conduct an analysis that assumes a direct relationship between several IVs and one DV. This is similar to the previous example. For example, you may wish to model the influence of some or all of the variables on the independent variable average speed.

  • Make sure to conduct EDA first, then create the statistical model, then produce a visual output of your analysis.

  • A more complex option is to explore the possibility of a mediating variable within the data. For example, in your hypothesis, stamina may thought to be a mediating variable on performance that is influenced by the other IVs.

If you carry out both analyses, what do you notice about the AIC? Is the mediating model a better fit with the data than the one that only models direct effects?

If you are interested in exploring path analysis further, you may wish to investigate how to include both the mediating and direct effects of variables.

Show solution for mediating variable
# Load lavaan package
library(lavaan)

# Define the model (specifying the paths)
model <- '
  # Direct paths
  average_speed ~ experience_years + age + stamina
  # Mediation paths
  stamina ~ training_hours + diet_quality
'

# Fit model
fit <- sem(model, data = swimming_data)

# Summarise results
summary(fit, standardized = TRUE, fit.measures = TRUE)
lavaan 0.6.17 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         7

  Number of observations                           100

Model Test User Model:
                                                      
  Test statistic                                 3.820
  Degrees of freedom                                 4
  P-value (Chi-square)                           0.431

Model Test Baseline Model:

  Test statistic                               135.150
  Degrees of freedom                                 9
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.003

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)               -279.783
  Loglikelihood unrestricted model (H1)       -277.873
                                                      
  Akaike (AIC)                                 573.567
  Bayesian (BIC)                               591.803
  Sample-size adjusted Bayesian (SABIC)        569.695

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.148
  P-value H_0: RMSEA <= 0.050                    0.554
  P-value H_0: RMSEA >= 0.080                    0.298

Standardized Root Mean Square Residual:

  SRMR                                           0.039

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  average_speed ~                                                       
    experience_yrs    0.030    0.016    1.879    0.060    0.030    0.180
    age               0.001    0.015    0.076    0.939    0.001    0.007
    stamina           0.035    0.014    2.574    0.010    0.035    0.248
  stamina ~                                                             
    training_hours    0.522    0.045   11.663    0.000    0.522    0.635
    diet_quality      0.662    0.064   10.331    0.000    0.662    0.563

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .average_speed     0.239    0.034    7.071    0.000    0.239    0.905
   .stamina           3.864    0.546    7.071    0.000    3.864    0.297
# Create the path diagram
semPaths(fit, whatLabels = "est", layout = "tree", edge.label.cex = 0.8)

38.4 SEM: Demonstration

We use Structural Equation Modeling (SEM) when we suspect there may be underlying (‘latent’) variables present in our data, similar to the concept introduced in our section on Factor Analysis.

SEM lets us explore this kind of complex model using a combination of factor analysis and regression.

As with everything we’re covering, this is just intended to be a basic introduction to the topic!

38.4.1 Step One: Load data

Show code for data creation
rm(list=ls())

set.seed(123) # Ensure reproducibility

# Number of observations
n <- 500

# Simulating independent variables
player_age <- round(rnorm(n, 25, 4),0) # Player age in years
years_in_NFL <- round(rnorm(n, 3, 2),0) # Years in NFL
games_played <- round(rpois(n, 16),0) # Games played in a season
total_yards <- round(rnorm(n, 500, 150),0) # Total yards
touchdowns <- round(rpois(n, 5),0) # Touchdowns
interceptions <- round(rpois(n, 2),0) # Interceptions
tackles <- rpois(n, 40) # Tackles made
sacks <- rpois(n, 5) # Sacks

# Simulating latent variables "Performance" and "Experience"
# For simulation purposes, we will create proxies for these latent constructs.
performance_proxy <- total_yards + touchdowns*10 - interceptions*5 + tackles + sacks*2
experience_proxy <- player_age + years_in_NFL*2

# Normalise proxies to have them on a similar scale
performance_proxy <- scale(performance_proxy)
experience_proxy <- scale(experience_proxy)

# Simulating "Player Value" influenced by "Performance" and "Experience"
player_value <- round(5*performance_proxy + 3*experience_proxy + rnorm(n, 0, 1),2) # Add noise

# Create dataframe
data <- data.frame(player_age, years_in_NFL, games_played, total_yards, touchdowns, interceptions, tackles, sacks, player_value)

# View the first few rows of the dataset
head(data)
  player_age years_in_NFL games_played total_yards touchdowns interceptions
1         23            2           12         512          9             2
2         24            1           16         504          6             1
3         31            5           17         536          4             2
4         25            5            9         549          4             4
5         26            0           20         525          1             5
6         32            3           16         711          5             1
  tackles sacks player_value
1      34     3        -0.97
2      44     4        -0.90
3      33    10         5.55
4      33    11         3.12
5      49     4        -1.54
6      30     5         8.51
Show code for data creation
rm(experience_proxy, performance_proxy, player_value, games_played, player_age, n, sacks, tackles, total_yards, touchdowns, years_in_NFL, interceptions)

38.4.2 Step Two: Exploratory data analysis

# Load necessary libraries

library(dplyr)
library(ggplot2)
library(lavaan)

# Exploratory Data Analysis

library(corrplot)
cor_matrix <- cor(data) # create the correlation matrix
cov_matrix <- cov(data) # create the covariance matrix

# Visualise  correlation matrix
corrplot(cor_matrix, method = "number")

rm(cor_matrix, cov_matrix)

38.4.3 Step Three: Specify the Structural Equation Model (SEM)

As for path analysis, we need to define the model.

The independent variable we are investigating is player_value. We could create a path analysis that looks at direct relationships between these variables.

library(lavaan)

# Define our path analysis model
model <- '
  player_value ~ total_yards + touchdowns + interceptions + tackles + sacks + player_age + years_in_NFL
'
# Fit model
fit <- sem(model, data=data, estimator="MLR") # Using Maximum Likelihood estimation for robust estimation

# Summarise results
summary(fit, standardized = TRUE, fit.measures = TRUE)
lavaan 0.6.17 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         8

  Number of observations                           500

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                              1826.989    1822.431
  Degrees of freedom                                 7           7
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.003

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)               -697.307    -697.307
  Loglikelihood unrestricted model (H1)       -697.307    -697.307
                                                                  
  Akaike (AIC)                                1410.615    1410.615
  Bayesian (BIC)                              1444.332    1444.332
  Sample-size adjusted Bayesian (SABIC)       1418.939    1418.939

Root Mean Square Error of Approximation:

  RMSEA                                          0.000          NA
  90 Percent confidence interval - lower         0.000          NA
  90 Percent confidence interval - upper         0.000          NA
  P-value H_0: RMSEA <= 0.050                       NA          NA
  P-value H_0: RMSEA >= 0.080                       NA          NA
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Regressions:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
  player_value ~                                                          
    total_yards        0.032    0.000  104.604    0.000     0.032    0.808
    touchdowns         0.334    0.019   17.953    0.000     0.334    0.119
    interceptions     -0.196    0.032   -6.029    0.000    -0.196   -0.046
    tackles            0.035    0.007    4.880    0.000     0.035    0.035
    sacks              0.064    0.021    3.041    0.002     0.064    0.023
    player_age         0.527    0.011   47.840    0.000     0.527    0.340
    years_in_NFL       1.104    0.022   50.947    0.000     1.104    0.369

Variances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .player_value       0.953    0.056   17.127    0.000     0.953    0.026
# Create the path diagram
semPaths(fit, whatLabels = "est", layout = "tree", edge.label.cex = 0.8)

rm(model,fit)

Unlike path analysis, which only deals with observed variables, in SEM model-building involves specifying any hypothesised latent variables, and the relationships among them.

Remember: that’s the reason we use SEM rather than path analysis.

For simplicity, let’s assume a model where Performance is a latent variable indicated by total_yards and touchdowns, while a second latent variable is called Experience and is indicated by player_age and years_in_NFL.

library(lavaan)

# Define the SEM model
model <- '
  # Latent variable definitions
  Performance =~ total_yards + touchdowns
  Experience =~ player_age + years_in_NFL
  
  # Outcome variable
  PlayerValue =~ player_value
  
  # Relationships
  PlayerValue ~ Performance + Experience
'

38.4.4 Step Four: Fit the Model

I’ll use the lavaan package to estimate the model.

Before doing so, I’m going to standardise my variables, which can be helpful if different variables are measured on very different scales.

# Scale the variables

data$player_age <- scale(data$player_age)
data$years_in_NFL <- scale(data$years_in_NFL)
data$games_played <- scale(data$games_played)
data$total_yards <- scale(data$total_yards)
data$touchdowns <- scale(data$touchdowns)
data$interceptions <- scale(data$interceptions)
data$tackles <- scale(data$tackles)
data$sacks <- scale(data$sacks)
data$player_value <- scale(data$player_value)

# Fit the model
fit <- sem(model, data = data, estimator = "MLR", control = list(maxit = 15000)) # Increase max iterations to 10000

# Model summary
summary(fit, fit.measures = TRUE)
lavaan 0.6.17 ended normally after 1169 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        12

  Number of observations                           500

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 1.438       1.587
  Degrees of freedom                                 3           3
  P-value (Chi-square)                           0.697       0.662
  Scaling correction factor                                  0.906
    Yuan-Bentler correction (Mplus variant)                       

Model Test Baseline Model:

  Test statistic                              1763.412    1856.167
  Degrees of freedom                                10          10
  P-value                                        0.000       0.000
  Scaling correction factor                                  0.950

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.003       1.003
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.002

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2663.857   -2663.857
  Scaling correction factor                                  1.033
      for the MLR correction                                      
  Loglikelihood unrestricted model (H1)      -2663.138   -2663.138
  Scaling correction factor                                  1.007
      for the MLR correction                                      
                                                                  
  Akaike (AIC)                                5351.714    5351.714
  Bayesian (BIC)                              5402.289    5402.289
  Sample-size adjusted Bayesian (SABIC)       5364.200    5364.200

Root Mean Square Error of Approximation:

  RMSEA                                          0.000       0.000
  90 Percent confidence interval - lower         0.000       0.000
  90 Percent confidence interval - upper         0.056       0.062
  P-value H_0: RMSEA <= 0.050                    0.926       0.899
  P-value H_0: RMSEA >= 0.080                    0.008       0.013
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.056
  P-value H_0: Robust RMSEA <= 0.050                         0.926
  P-value H_0: Robust RMSEA >= 0.080                         0.006

Standardized Root Mean Square Residual:

  SRMR                                           0.014       0.014

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  Performance =~                                      
    total_yards       1.000                           
    touchdowns        0.216    0.048    4.523    0.000
  Experience =~                                       
    player_age        1.000                           
    years_in_NFL      1.092    0.108   10.081    0.000
  PlayerValue =~                                      
    player_value      1.000                           

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  PlayerValue ~                                       
    Performance       3.038    1.210    2.512    0.012
    Experience       -4.406    3.605   -1.222    0.222

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
  Performance ~~                                      
    Experience        0.041    0.028    1.448    0.148

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .total_yards       0.660    0.144    4.577    0.000
   .touchdowns        0.982    0.078   12.527    0.000
   .player_age        1.050    0.075   13.920    0.000
   .years_in_NFL      1.060    0.083   12.788    0.000
   .player_value      0.000                           
    Performance       0.338    0.139    2.437    0.015
    Experience       -0.052    0.040   -1.290    0.197
   .PlayerValue      -0.025    1.594   -0.016    0.987

38.4.5 Step Five: Inspect the model

Interpreting the outputs from a structural equation model (SEM) using the lavaan package in R involves understanding various components of the model output. Here’s a guide to some of the key elements you’ll find in the output:

  • Model Estimation Summary Estimates: This section shows the estimated parameters of the model. In SEM, these include factor loadings (relationships between observed variables and latent variables), regression weights (relationships among latent variables), and variances and covariances of the variables.

  • Factor Loadings: Indicate how strongly each observed variable represents the latent variable. Higher absolute values indicate a stronger relationship.

  • Regression Weights: Show the direction and strength of the relationships between latent variables.

  • Significance: Usually indicated by p-values. A p-value less than 0.05 typically suggests that the parameter is significantly different from zero, lending support to that part of the model.

Fit Indices The output provides various fit indices which help in evaluating how well the model fits the data. Common indices include:

  • Chi-Square Test of Model Fit (χ²): A non-significant chi-square (p > 0.05) suggests a good fit. However, it’s sensitive to sample size.

  • Comparative Fit Index (CFI): Values closer to 1 indicate a good fit, with values above 0.90 or 0.95 often considered indicative of a good fit.

  • Tucker-Lewis Index (TLI): Similar to CFI, with values close to 1 suggesting a good fit. Values above 0.90 are typically considered good.

  • Root Mean Square Error of Approximation (RMSEA): Values less than 0.05 indicate a close fit, values up to 0.08 represent a reasonable error of approximation, and values greater than 0.10 suggest a poor fit.

  • Standardized Root Mean Square Residual (SRMR): Represents the difference between the observed correlation and the model-predicted correlation. Values less than 0.08 are generally considered good.

38.4.6 Step Six: Evaluate Model Fit

Check various fit indices like CFI, TLI, RMSEA to evaluate the model.

  • Comparative Fit Index (CFI): Values above 0.90 or 0.95 suggest a good fit.

  • Tucker-Lewis Index (TLI): Like the CFI, values above 0.90 or 0.95 are typically considered indicative of a good fit.

  • Root Mean Square Error of Approximation (RMSEA): Values less than 0.05 indicate a close fit, and values up to 0.08 represent a reasonable error of approximation.

  • Standardized Root Mean Square Residual (SRMR): Values less than 0.08 are generally considered good.

# Fit indices
fitMeasures(fit)
                         npar                          fmin 
                       12.000                         0.001 
                        chisq                            df 
                        1.438                         3.000 
                       pvalue                  chisq.scaled 
                        0.697                         1.587 
                    df.scaled                 pvalue.scaled 
                        3.000                         0.662 
         chisq.scaling.factor                baseline.chisq 
                        0.906                      1763.412 
                  baseline.df               baseline.pvalue 
                       10.000                         0.000 
        baseline.chisq.scaled            baseline.df.scaled 
                     1856.167                        10.000 
       baseline.pvalue.scaled baseline.chisq.scaling.factor 
                        0.000                         0.950 
                          cfi                           tli 
                        1.000                         1.003 
                   cfi.scaled                    tli.scaled 
                        1.000                         1.003 
                   cfi.robust                    tli.robust 
                        1.000                         1.002 
                         nnfi                           rfi 
                        1.003                         0.997 
                          nfi                          pnfi 
                        0.999                         0.300 
                          ifi                           rni 
                        1.001                         1.001 
                  nnfi.scaled                    rfi.scaled 
                        1.003                         0.997 
                   nfi.scaled                   pnfi.scaled 
                        0.999                         0.300 
                   ifi.scaled                    rni.scaled 
                        1.001                         1.001 
                  nnfi.robust                    rni.robust 
                        1.002                         1.001 
                         logl             unrestricted.logl 
                    -2663.857                     -2663.138 
                          aic                           bic 
                     5351.714                      5402.289 
                       ntotal                          bic2 
                      500.000                      5364.200 
            scaling.factor.h1             scaling.factor.h0 
                        1.007                         1.033 
                        rmsea                rmsea.ci.lower 
                        0.000                         0.000 
               rmsea.ci.upper                rmsea.ci.level 
                        0.056                         0.900 
                 rmsea.pvalue                rmsea.close.h0 
                        0.926                         0.050 
        rmsea.notclose.pvalue             rmsea.notclose.h0 
                        0.008                         0.080 
                 rmsea.scaled         rmsea.ci.lower.scaled 
                        0.000                         0.000 
        rmsea.ci.upper.scaled           rmsea.pvalue.scaled 
                        0.062                         0.899 
 rmsea.notclose.pvalue.scaled                  rmsea.robust 
                        0.013                         0.000 
        rmsea.ci.lower.robust         rmsea.ci.upper.robust 
                        0.000                         0.056 
          rmsea.pvalue.robust  rmsea.notclose.pvalue.robust 
                        0.926                         0.006 
                          rmr                    rmr_nomean 
                        0.014                         0.014 
                         srmr                  srmr_bentler 
                        0.014                         0.014 
          srmr_bentler_nomean                          crmr 
                        0.014                         0.018 
                  crmr_nomean                    srmr_mplus 
                        0.018                         0.014 
            srmr_mplus_nomean                         cn_05 
                        0.014                      2718.850 
                        cn_01                           gfi 
                     3946.581                         0.999 
                         agfi                          pgfi 
                        0.994                         0.200 
                          mfi                          ecvi 
                        1.002                         0.051 

38.4.7 Step Seven: Visual Output

We can use the semPlot package for visual representation.

library(semPlot)

# SEM plot
semPaths(fit, whatLabels = "est", layout = "tree")

38.5 SEM: Practice

Repeat the process from the demonstration.

This time, build an alternative model (for example, where Experience is influenced by player value, years in NFL, and Performance is by tackles and sacks) . You could use touchdowns as your Independent Variable.